# Multimodal instruction understanding
Mistral Small 3.2 24B Instruct 2506 GGUF
Apache-2.0
Mistral-Small-3.2-24B-Instruct-2506 is an image-text-to-text model that performs excellently in model quantization and shows significant improvements in instruction following, reducing repetition errors, and function calls.
Image-to-Text Supports Multiple Languages
M
unsloth
8,640
32
Ultravox V0 5 Llama 3 1 8b
MIT
A multilingual audio-to-text model based on Llama-3.1-8B-Instruct, supporting processing of over 40 languages
Large Language Model
Transformers Supports Multiple Languages

U
FriendliAI
218
0
Qwen2.5 VL 32B Instruct GGUF
Apache-2.0
Qwen2.5-VL-32B-Instruct is a 32B-parameter multimodal vision-language model that supports joint understanding and generation tasks for images and text.
Image-to-Text English
Q
Mungert
9,766
6
Qwen2 VL 2B Instruct
Apache-2.0
Qwen2-VL-2B-Instruct is a multimodal vision-language model that supports image-text-to-text tasks.
Image-to-Text
Transformers English

Q
FriendliAI
24
1
Instructclip InstructPix2Pix
Apache-2.0
InstructCLIP is an instruction-guided image editing model improved through contrastive learning-based automatic data optimization. It combines CLIP and Stable Diffusion technologies to edit images based on textual instructions.
Text-to-Image English
I
SherryXTChen
450
5
Phi 4 Multimodal Instruct
MIT
Phi-4-multimodal-instruct is a lightweight open-source multimodal foundation model that integrates language, vision, and speech research and datasets from Phi-3.5 and 4.0 models. It supports text, image, and audio inputs to generate text outputs, with a context length of 128K tokens.
Multimodal Fusion
Transformers Supports Multiple Languages

P
Robeeeeeeeeeee
21
1
Phi 4 Multimodal Instruct Onnx
MIT
ONNX version of the Phi-4 multimodal model, quantized to int4 precision with accelerated inference via ONNX Runtime, supporting text, image, and audio inputs.
Multimodal Fusion Other
P
microsoft
159
66
Llama 3.2 11B Vision Instruct GGUF
Llama-3.2-11B-Vision-Instruct is a multilingual vision-language model that can be used for image-text to text conversion tasks.
Image-to-Text
Transformers Supports Multiple Languages

L
pbatra
172
1
Taivisionlm Base V2
The first vision-language model supporting Traditional Chinese instruction input (1.2B parameters), compatible with Transformers library, quick to load and easy to fine-tune
Image-to-Text
Transformers Chinese

T
benchang1110
122
4
Octo Small 1.5
MIT
Octo Small is a diffusion policy model for robot control, based on Transformer architecture, capable of predicting robot actions from visual inputs and language instructions.
Multimodal Fusion
Transformers

O
rail-berkeley
250
6
Featured Recommended AI Models